title: “What next? Modeling human behavior using smartphone usage data and (deep) recommender systems” subtitle: author: | | Simon Wiegrebe date: “October 01, 2021” output: beamer_presentation: # includes: # in_header: head.tex toc: true slide_level: 2 theme: “Goettingen” colortheme: “dolphin” fonttheme: “structurebold” bibliography: bibliography.bib biblio-style: myabbrvnat header-includes: -

1 Motivation

Introduction

  • Smartphone usage has been becoming a valuable source of data in recent years:
    • large volume
    • ubiquitous
    • easily accessible
    • clean
    • representative of actual human behavior
  • Behavioral researchers: investigating human behavioral traits through smartphone usage
  • Most behavioral research: association between smartphone usage patterns and pre-established personality traits
  • Here: data-centric approach to the modeling of human behavioral sequences

Research Idea

  • Smartphone usage data from a PhoneStudy project [@phonedata]

  • There is a natural sequential order in the data:

    • An app session starts by switching on the screen and ends by switching it off
    • The apps used in between, ordered by their timestamps, as well as the ON and OFF tokens form the events of an app session
  • Model behavioral sequences by means of next-event prediction

  • Large number of possible events + sequential data \(\rightarrow\) Use sequence-aware recommender system (RS) algorithms

2 Data

Description

  • PhoneStudy dataset from a mobile sensing research project [@phonedata]
  • 310 users, study period from October 29, 2017 through January 22, 2018
  • Each app usage assigned exact opening date and time

Table XXX: Excerpt of anonymized app-level data.

App-level Representation

  • In language modeling:
    • Tokens \(\widehat{=}\) words
    • Sentence \(\widehat{=}\) concatenation of tokens ending with a period
  • Here:
    • Tokens \(\widehat{=}\) apps
    • Sentences \(\widehat{=}\) sessions
  • Objective: next-app prediction
    • Predicting the next app a user is going to use in a given session
  • Mostly very short sessions

Sequence-level Representation

  • How to address the issue of short session length?
  • Focus on behavior, not individual apps
    • App-level sessions \(:=\) concatenations of app-level
    • These categories were pre-established by @stachl2020predicting
      • E.g.: “WhatsApp” \(\rightarrow\) “Messaging”
  • Now:
    • Tokens \(\widehat{=}\) app-level sessions
    • Sentences \(\widehat{=}\) daily concatenations of a user’s sessions
  • For the sake of unambiguousness: use the terms “sequence” and “event”

Summary Statistics

Table XXX: Summary statistics of app-level and sequence-level data.

  • Drawback of sequence-level analysis: data size
    • events in sequence-level \(\approx\) sequences in app-level

3 Modeling

Definitions and Terminology

  • Baseline model \(:=\) non-NN-based model
  • Session-based model: no incorporation of user-level information (user ID)
  • Session-aware model: incorporation of user-level information
  • \(s=(s_1, s_2, \dots, s_m)\): sequence of chronologically ordered events
    • \(s_s\): last “known” event in the sequence
    • \(s_{s+1}\): event we seek to predict
  • \(i\): candidate event for \(s_{s+1}\)

Session-based Baseline Models (I)

  • and [@ludewig2018evaluation]
    • are based on co-occurrence frequencies
    • only take into account \(s_s\) when making a prediction
    • simply counts co-occurrences of \(s_s\) with \(i\)
    • normalizes this count by the number of all co-occurrences
    • accounts for sequential event order
    • only counts co-occurrences where \(s_s\) precedes any \(i\)
    • decreases the weight if other events occurred in-between

Session-based Baseline Models (II)

  • The neighborhood-based [@jannach2017recurrent]
    • defines a neighborhood of most similar past sequences
    • determines similarity between \(s\) and neighbor sequences
    • computes score as sum of similarity scores across all sequences which contain \(i\)
  • [@garg2019sequence] and [@ludewig2021empirical] extend , for instance by
    • accounting for event recency in \(s\) using decay functions
    • accounting for sequence recency of neighbor sequences

Session-based Neural Models

  • [@hidasi2015session]
    • initially one-hot encodes single input events
    • feeds input vectors into a Gated Recurrent Unit (GRU) layer
    • uses pairwise ranking losses for training
    • outputs, for each event, the likelihood of being next in the sequence

Session-aware Neural Models

  • [@quadrana2017personalizing]
    • is a user-aware extension of
    • contains a short- and long-term memory GRU layer
    • generates recommendations for each event in a sequence through a session-level GRU (like )
    • updates an additional user-level GRU at the end of each sequence
    • employs its hidden state for initialization of the session-level GRU at the beginning of the next sequence

Extensions

  • We use (a combination of) 3 different heuristics for some session-based algorithms
  • Extensions contribute user-level information from past sequences \(\rightarrow\) session-awareness
  1. prepends events from the user’s preceding sequence if \(s\) is short
  2. increases the score of \(i\) if \(i\) has occurred in the user’s past sequences
  3. adds a reminder score to the original model score

Implementation

  • Implementation of models and extensions based on @latifi2021session1
  • We perform all modeling, evaluation, and analysis tasks in Python

4 Evaluation

Train-Validation-Test Split

  • Time-ordered and user-clustered data
  • Standard time-agnostic cross-validation not applicable
  • Last-event split method, applied twice:
    1. Clip off each user’s last sequence \(\rightarrow\) test set
    2. Clip off each user’s last sequence from the remaining data \(\rightarrow\) validation set
  • Each user required to have \(\ge 3\) sequences
  • Additionally: split study period into 5 equally long sub-periods (windows)
    • Apply train-validation-test split to all 5 windows
    • Average performance results across all 5 test sets

Evaluation Protocol

  • Evaluate predictions for all but the first test sequence events
  • Preferable to
  • How to define ground truth:
    • Mostly interested in predicting the single next action of a user
    • Our definition: only the event observed at specific position

Evaluation Metrics (I)

  • Target variable follows a multinomial distribution with large number of categories
  • We wish to quantify the goodness of our recommendation list of length k
  • We wish to perform next-event prediction with our ground truth being a single event
  • Let \(n\) be the total number of events to be predicted

Evaluation Metrics (II)

  1. Hit Rate (HR): \(HR@k\) is simply the fraction of events for which the corresponding recommendation list of length \(k\), \(rl(k)_i\), includes the ground truth, \(y_i\): \[\begin{align*} HR@k &= \frac{\sum_{i=1}^n \mathbbm{1}_{rl(k)_i}(y_i)}{n} \end{align*}\]

  2. Mean Reciprocal Rank (MRR): \(MRR@k\) additionally accounts for the ranking within the recommendation list. \(MRR@k\) computes the reciprocal rank of the ground truth within the recommendation list, \(rr_i\), then averages this reciprocal rank across all \(n\) events: \[\begin{align*} MRR@k = \frac{\sum_{i=1}^n rr_i}{n} \end{align*}\]

  • We consider \(HR@k\) and \(MRR@k\) for \(k \in \{1,5,10,20\}\)

Tuning

  • Simple random search with budget 100 for each algorithm
  • Hyperparameter search spaces as in @latifi2021session
  • Tuning on five-window data, then averaging performance to determine optimal hyperparameter configuration
  • Tuning metric: \(HR@1\)

5 App-level Results

Overall Performance

Figure XXX: \(HR@k\) performance for \(k=1, 5, 10, \text{ and } 20\) on five-window app-level data.

  • Best performer i.t.o. \(HR@1\) and \(HR@5\):
  • Best performer i.t.o. \(HR@10\) and \(HR@20\):
  • Strong \(HR@1\) performance of NN-based models

Minimum Sequence Length (I)

  • Background:
    • , , and employ RNNs
    • These learn from the present sequence whereas non-neural methods mostly “look up” similar sequences or app combinations
    • App-level sequences are typically short \(\rightarrow\) RNN-based methods do not have “much to learn from”
  • Hypotheses:
    • Better performance of NN-based models on longer sequences
    • No impact of sequence length on performance of , , and

\(\rightarrow\) Train and evaluate our models on a subset containing only sequences with at least 20 events.

Minimum Sequence Length (II)

Figure XXX: \(HR@k\) comparison between performance on full five-window app-level data (left bars) and performance on five-window app-level data when only training and evaluating on sequences with a minimum length of 20 (right bars), for \(k \in \{1,5,10,20\}\).}

  • still best performer for \(HR@1\) and \(HR@5\)
  • No large changes for , , and
  • Performance of NN-based models improves

Minimum Sequence Length (III)

  • What if instead we train on all sequences and only evaluate on long sequences?
    • still best performer
    • All neural models perform considerably worse
    • Surprising because the full training dataset is considerably larger
  • Conclusion: performance on long sequences benefits from training on long sequences only

Position in Test Sequence (I)

Figure XXX: \(HR@1\) performance across the first ten prediction positions on five-window app-level data.

  • Initial performance boost for
  • No clear trend for all other models

Position in Test Sequence (II)

  • Worse performance for NN-based models on later positions
  • if training is not tailored towards them: NN-based models struggle with later positions in the prediction sequences and, consequently, with long prediction sequences

Removing ON and OFF Events (I)

  • Key issue and potential performance bottleneck: short sequence length
  • ON and OFF events are hardly informative
  • ON-OFF sequences make up \(38.91\%\) of all app-level sequences
  • Effect of dropping all ON and OFF events from the app-level data?

Removing ON and OFF Events (II)

Figure XXX: \(HR@k\) performance comparison between full five-window app-level data (left bars) and five-window app-level data after dropping all ON and OFF events (right bars), for \(k \in \{1,5,10,20\}\).}

  • Improvements i.t.o. \(HR@1\) across the board
  • Substantial improvements for neighborhood-based models
  • Drawback: limited representativeness of results

Category-level Prediction (I)

  • Ultimate goal: predict human behavioral sequences \(\rightarrow\) consider next-category prediction instead of next-app prediction.
  • For evaluation, simply consider app category: e.g., “messaging” instead of “WhatsApp”.
  • If performance improves considerably: models learn more about behavioral sequences than previously thought

Category-level Prediction (II)

Figure XXX: \(HR@k\) performance increases on five-window app-level data when only considering app categories for evaluation (left bars), instead of considering the individual apps as well (right bars), for \(k \in \{1,5,10,20\}\).

  • Performance increases especially for larger \(k\), more pronounced for NN-based methods, and proportional to app-level performance

Embedding Analysis (I)

  • Can deep learning models learn smartphone app semantics?
  • Do apps from a common app category form clusters in the embedding space? \(\rightarrow\) Add an embedding layer (\(d=128\)) to
  • Apply TSNE [@hinton2002stochastic] to obtain two-dimensional app embeddings

Embedding Analysis (II)

Figure XXX: App category-based clustering of app-level embeddings. Blue dots represent apps categorized as , red dots represent apps categorized as . For illustration, app embeddings are reduced to a dimensionality of two.

  • No category-level clustering recognizable
  • Only for \(11.67\%\) of apps their most similar app (i.t.o. cosine similarity) is from the same category

Embedding Analysis (III)

  • Alternatively: start off with data-driven clustering approach k-means
  • Look at potential accumulations of app categories within each cluster

Embedding Analysis (IV)

Figure XXX: k-means clustering of app-level embeddings (\(k=15\)). For illustration, app embeddings are reduced to a dimensionality of two.

  • Moccasin-colored cluster: 32 out of 52 apps (>60%) are camera or image editing apps
  • However: vast majority of clusters dispersed across app space, with little intra-cluster app category clustering.

Embedding Analysis (V)

  • Experimentally construct app analogies such as “Messaging 1 + Social Network 1 - Social Networks 2 = ???”.
  • We find no meaningful app analogies in our embeddings:
    • App analogies conceptually much less intuitive than word embeddings
    • Low overall quality of embeddings

6 Sequence-level Results

Overall Performance

Figure XXX: \(HR@k\) performance for \(k=1, 5, 10, \text{ and } 20\) on five-window sequence-level data.

  • Strong \(HR@1\) performance by all algorithms
  • Low performance increases with increasing \(k\)
  • and weakest performers for \(k>1\)

Removing ON-OFF Tokens (I)

  • Suspiciously high \(HR@1\) performance across all algorithms
  • High prevalence of ON-OFF tokens (\(51.06\%\))
  • All algorithms predict ON-OFF tokens (almost) everywhere
    • Predictive performance on other tokens ~\(0\%\)
  • Effect of removing ON and OFF events from underlying app-level data?

Removing ON-OFF Tokens (II)

Figure XXX: \(HR@k\) performance comparison for all selected algorithms between full five-window sequence-level data (left bars) and five-window sequence-level data after dropping all ON and OFF events (right bars), for \(k \in \{1,5,10,20\}\).

  • Performance drops for all algorithms, especially i.t.o. \(HR@1\)
  • best, and worst performers

Position in Test Sequence (I)

Figure XXX: \(HR@1\) performance across the first ten prediction positions on five-window sequence-level data for all selected algorithms.

  • ON and OFF events removed from the underlying app-level data
  • No clear trend for any of the models

Position in Test Sequence (II)

  • All models except perform better on later positions of the test sequences
  • The precise positioning of the cutoff not very relevant

Position in Test Sequence (III)

  • For NN-based models: performance improvement for later events in line with expectations
  • Comparison app- versus sequence-level data:
    • App-level setting: predominantly short sequences
    • Sequence-level setting: mostly long sequences
  • Corroborates our previous conclusion: differences in sequence lengths between training and evaluation data negatively affect the performance of NN-based algorithms.

7 Discussion

Conclusion (I)

  • By and large, strong predictive performance of most algorithms
  • NN-based models mostly perform well i.t.o. \(HR@1\) and \(HR@5\)
    • Amongst them, is often the weakest one
  • NN-based model performance is prone to sequence length and data size
  • NN-based models are very expensive i.t.o. runtime and computational effort
  • Simple, non-NN models are the preferable modeling choice for our data

Conclusion (II)

  • recommendable i.t.o. \(HR@1\) and \(HR@5\), no tuning

  • exhibits strong performance i.t.o. \(HR@10\) and \(HR@20\), fast

  • No overarching user-level effects in our data

    • For predicting future behavioral sequences of a particular user, not overly helpful to know this particular person’s past smartphone usage patterns
  • User-level extensions mostly effective, especially for short sequences and early positions

    • not due to some profound user-level learning
    • instead, addressing technical weaknesses of the session-based baseline algorithm
    • e.g., alleviates poor early-position performance of other neighborhood-based models stemming from low informational content in short sequences

Limitations

  • Dataset size: potentially giving a relative advantage to non-neural methods
  • Algorithm selection: not including some of the modern sophisticated approaches, e.g., [@sun2019bert4rec]
    • Attention-based models require even more training data
    • Their main advantage is the better handling of dependencies while we mostly have sequences

Suggestions for Future Research

  • Increased dataset size: new PhoneStudy dataset \(\rightarrow\) Investigate impact of data size on (NN-based) model performance
  • Information extraction: incorporation of duration, exact daytime, and geolocation of app usage
  • Transfer learning: use of pre-trained transformers?

8 References


  1. https://github.com/rn5l/session-rec/↩︎